Skip to content

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324

Open
bradleyshep wants to merge 28 commits into
masterfrom
bradley/fix-validate-goldens-ci
Open

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324
bradleyshep wants to merge 28 commits into
masterfrom
bradley/fix-validate-goldens-ci

Conversation

@bradleyshep

Copy link
Copy Markdown
Contributor

Note 1: this requires a website PR to merge

Note 2:

I was able to run all workflow smoke tests successfully, including golden validation and dry-run benchmarks, except for the C# dry-run benchmark path. C# golden validation passes, but the C# benchmark dry run still fails intermittently/consistently on the runner despite several attempts to align its build/publish setup with the known-good smoketest path.

gh workflow run llm-benchmark-periodic.yml `
  --repo ClockworkLabs/SpacetimeDB `
  --ref bradley/fix-validate-goldens-ci `
  -f model_set=explicit `
  -f models="openrouter:openai/gpt-5.4-mini" `
  -f languages=rust,csharp,typescript `
  -f modes=guidelines `
  -f tasks=t_000_empty_reducers `
  -f dry_run=true

Description of Changes

This updates the LLM benchmark automation and runner plumbing.

  • Move periodic LLM benchmark and golden validation workflows from daily/nightly to weekly Monday UTC runs.
  • Add manual workflow inputs for benchmark smoke runs:
    • model set: website-managed, local defaults, or explicit models
    • languages, modes, categories, tasks
    • dry-run mode
  • Build the local TypeScript SDK before TypeScript benchmark/golden validation runs.
  • Add support for fetching active/available benchmark models from the website API via --model-source remote.
  • Keep explicit --models ... working for manual/local overrides.
  • Add OpenRouter preflight checks before benchmark execution:
    • checks key/account credits when available
    • probes the selected model when credit balance cannot be checked
    • supports OPENROUTER_ALLOW_UNCHECKED_CREDITS=1 escape hatch
    • supports OPENROUTER_MIN_CREDITS / LLM_MIN_CREDITS
  • Force scheduled benchmark workflow runs through OpenRouter with LLM_VENDOR=openrouter, while preserving direct OpenAI support for local/manual use.
  • Improve benchmark publishing isolation:
    • isolated SpacetimeDB CLI root per publish
    • serialized C# benchmark publish concurrency
    • local NuGet package references for generated C# benchmark projects
    • Windows/PATH handling for TypeScript pnpm
  • Update default benchmark model routes to current model names/ids.
  • Update TypeScript golden answers for current SDK shape.

API and ABI breaking changes

None.

This adds benchmark-runner/workflow behavior and CLI options, but does not change SpacetimeDB runtime API or ABI.

Expected complexity level and risk

3/5

The changes are mostly isolated to the LLM benchmark runner and GitHub workflows, but the risk is moderate because they touch CI execution paths, local SDK build assumptions, website-managed model resolution, OpenRouter routing, and generated module publish behavior across Rust, C#, and TypeScript.

The most sensitive pieces are:

  • GitHub Actions workflow dispatch/manual input behavior.
  • Remote model registry parsing from the website.
  • C# benchmark publish behavior on the self-hosted runner.

Testing

  • cargo check -p xtask-llm-benchmark --bin llm_benchmark
  • cargo test -p xtask-llm-benchmark --bin llm_benchmark
  • cargo test -p xtask-llm-benchmark parses_active_available_model_routes
  • Manual GitHub Actions golden validation smoke runs for Rust, C#, and TypeScript.
  • Run a dry-run periodic benchmark workflow from this branch with one explicit OpenRouter model, one task, and all languages.
  • Run a website-dispatched dry-run benchmark and verify it sends model_set=explicit plus selected model/task inputs.

# Description of Changes

Sets `DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1` only on the benchmark
harness command that publishes generated C# modules. This keeps dotnet
startup out of localized DateTime/TimeZoneInfo formatting on the CI
runner, which was crashing before generated C# module publish could run.

Stacked on #5324.

```bash
gh workflow run llm-benchmark-periodic.yml \
  --repo ClockworkLabs/SpacetimeDB \
  --ref bot/debug-llm-csharp-publish \
  -f model_set=explicit \
  -f models="openrouter:openai/gpt-5.4-mini" \
  -f languages=rust,csharp,typescript \
  -f modes=guidelines \
  -f tasks=t_000_empty_reducers \
  -f dry_run=true
```

# API and ABI breaking changes

None.

# Expected complexity level and risk

1. CI benchmark harness environment fix.

# Testing

- [x] `cargo fmt --all`
- [x] `cargo check --manifest-path tools/xtask-llm-benchmark/Cargo.toml`
- [x] `ruby -e 'require "yaml";
YAML.load_file(".github/workflows/llm-benchmark-periodic.yml");
YAML.load_file(".github/workflows/llm-benchmark-validate-goldens.yml")'`\n-
[x] `git diff --check`

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
false
fn signal_killed_by(_status: &std::process::ExitStatus) -> Option<i32> {
None
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole transient thing is a little sus, but it's not a regression, so it's fine.


/// Context limits for models accessed via OpenRouter.
/// Uses the same limits as direct clients where known,
/// falls back to a conservative default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little weird to have these in the code, rather than pulling them from OpenRouter, but it's not a regression so I'll let it go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Initially I was using differnet providers, which had different context limits that were not reachable by api. I think cleaner long-term is just to use openrouter and get the context limits from there... Probably even just get rid of the direct OpenAI/other providers.

@cloutiertyler cloutiertyler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems generally fine and is low risk, so once we get CI to pass, I'm good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants